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Abstract 

We introduce single-set spectral sparsification as a deterministic sampling based feature 
selection technique for regularized least squares classification, which is the classifica¬ 
tion analogue to ridge regression. The method is unsupervised and gives worst-case 
guarantees of the generalization power of the classification function after feature selec¬ 
tion with respect to the classification function obtained using all features. We also intro- 


duce leverage-score sampling as an unsupervised randomized feature selection method 
for ridge regression. We provide risk bounds for both single-set spectral sparsification 
and leverage-score sampling on ridge regression in the fixed design setting and show 
that the risk in the sampled space is comparable to the risk in the full-feature space. We 
perform experiments on synthetic and real-world datasets, namely a subset of TechTC- 
300 datasets, to support our theory. Experimental results indicate that the proposed 
methods perform better than the existing feature selection methods. 

1 Introduction 

Ridge regression is a popular technique in machine learning and statistics. It is a com¬ 
monly used penalized regression method. Regularized Least Squares Classifier (RLSC) 
is a simple classifier based on least squares and has a long history in machine learn¬ 
ing (Zhang and Peng, 2004; Poggio and Smale, 2003; Rifkin et ah, 2003; Fung and 
Mangasarian, 2001; Suykens and Vandewalle, 1999; Zhang and Oles, 2001; Agarwal, 
2002). RLSC is also the classification analogue to ridge regression. RLSC has been 
known to perform comparably to the popular Support Vector Machines (SVM) (Rifkin 
et ah, 2003; Fung and Mangasarian, 2001; Suykens and Vandewalle, 1999; Zhang and 
Oles, 2001). RLSC can be solved by simple vector space operations and do not require 
quadratic optimization techniques like SVM. 

We propose a deterministic feature selection technique for RLSC with provable guaran¬ 
tees. There exist numerous feature selection techniques, which work well empirically. 
There also exist randomized feature selection methods like leverage-score sampling. 
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(Dasgupta et al., 2007) with provable guarantees whieh work well empirieally. But the 
randomized methods have a failure probability and have to be re-run multiple times to 
get aeeurate results. Also, a randomized algorithm may not seleet the same features in 
different runs. A deterministie algorithm will seleet the same features irrespective of 
how many times it is run. This becomes important in many applications. Unsupervised 
feature selection involves selecting features oblivious to the class or labels. 

In this work, we present a new provably accurate unsupervised feature selection tech¬ 
nique for RLSC. We study a deterministic sampling based feature selection strategy for 
RLSC with provable non-trivial worst-case performance bounds. 

We also use single-set spectral sparsification and leverage-score sampling as unsuper¬ 
vised feature selection algorithms for ridge regression in the fixed design setting. Since 
the methods are unsupervised, it will ensure that the methods work well in the fixed 
design setting, where the target variables have an additive homoskedastic noise. The 
algorithms sample a subset of the features from the original data matrix and then per¬ 
form regression task on the reduced dimension matrix. We provide risk bounds for the 
feature selection algorithms on ridge regression in the fixed design setting. 

The number of features selected by both algorithms is proportional to the rank of the 
training set. The deterministic sampling-based feature selection algorithm performs 
better in practice when compared to existing methods of feature selection. 

2 Our Contributions 

We introduce single-set spectral sparsification as a provably accurate deterministic fea¬ 
ture selection technique for RLSC in an unsupervised setting. The number of features 
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selected by the algorithm is independent of the number of features, but depends on the 
number of data-points. The algorithm selects a small number of features and solves 
the classification problem using those features. Dasgupta et al. (2007) used a leverage- 
score based randomized feature selection technique for RLSC and provided worst case 
guarantees of the approximate classifier function to that using all features. We use 
a deterministic algorithm to provide worst-case generalization error guarantees. The 
deterministic algorithm does not come with a failure probability and the number of 
features required by the deterministic algorithm is lesser than that required by the ran¬ 
domized algorithm. The leverage-score based algorithm has a sampling complexity of 
O log whereas single-set spectral sparsification requires O (n/e^) to be 

picked, where n is the number of training points, b G (0,1) is a failure probability 
and e G (0, 1/2] is an accuracy parameter. Like in Dasgupta et al. (2007), we also 
provide additive-error approximation guarantees for any test-point and relative-error 
approximation guarantees for test-points that satisfy some conditions with respect to 
the training set. 

We introduce single-set spectral sparsification and leverage-score sampling as unsuper¬ 
vised feature selection algorithms for ridge regression and provide risk bounds for the 
subsampled problems in the fixed design setting. The risk in the sampled space is com¬ 
parable to the risk in the full-feature space. We give relative-error guarantees of the risk 
for both feature selection methods in the fixed design setting. 

From an empirical perspective, we evaluate single-set spectral sparsification on syn¬ 
thetic data and 48 document-term matrices, which are a subset of the TechTC-300 
(Davidov et al., 2004) dataset. We compare the single-set spectral sparsification al- 
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gorithm with leverage-score sampling, information gain, rank-revealing QR factoriza¬ 
tion (RRQR) and random feature selection. We do not report running times because 
feature selection is an offline task. The experimental results indicate that single-set 
spectral sparsification out-performs all the methods in terms of out-of-sample error for 
all 48 TechTC-300 datasets. We observe that a much smaller number of features is re¬ 
quired by the deterministic algorithm to achieve good performance when compared to 
leverage-score sampling. 

3 Background and Related Work 

3.1 Notation 

A, B,... denote matrices and a, b,... denote column vectors; Oj (for all i = 1.. .n) 
is the standard basis, whose dimensionality will be clear from context; and I„ is the 
nxn identity matrix. The Singular Value Decomposition (SVD) of a matrix A G 
is equal to A = USV^, where U G is an orthogonal matrix containing the 
left singular vectors, S G is a diagonal matrix containing the singular values 

> <72 > • • • cTd > 0, and V G is a matrix containing the right singular vectors. 
The spectral norm of A is ||A ||2 = <Ji. a^ax and cimm are the largest and smallest 
singular values of A. ka = (ymax 1^min is the condition number of A. U"*" denotes any 
n X (n — d) orthogonal matrix whose columns span the subspace orthogonal to U. A 
vector q G can be expressed as: q = Aa -f for some vectors ck G and 
/3 G i.e. q has one component along A and another component orthogonal to A. 
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3.2 Matrix Sampling Formalism 


We now present the tools of feature seleetion. Let A G be the data matrix eonsist- 
ing of n points and d dimensions, S G R^^'^ be a matrix sueh that SA G R'’^” contains 
r rows of A. Matrix S is a binary (0/1) indicator matrix, which has exactly one non¬ 
zero element in each row. The non-zero element of S indicates which row of A will be 
selected. Let D G R^'^'’ be the diagonal matrix such that DSA G R'”^" rescales the 
rows of A that are in SA. The matrices S and D are called the sampling and re-scaling 
matrices respectively. We will replace the sampling and re-scaling matrices by a single 
matrix R G where R = DS denotes the matrix specifying which of the r rows 
of A are to be sampled and how they are to be rescaled. 

3.3 RLSC Basics 

Consider a training data of n points in d dimensions with respective labels yi G {—1,-M} 
for i = 1,.., n. The solution of binary classification problems via Tikhonov regulariza¬ 
tion in a Reproducing Kernel Hilbert Space (RKHS) using the squared loss function re¬ 
sults in Regularized Least Squares Classification (RLSC) problem (Rifkin et ah, 2003), 
which can be stated as: 

min IlKx — yll?-f Ax^Kx (1) 

xSM" 

where K is the nxn kernel matrix defined over the training dataset, A is a regularization 
parameter and y is the n dimensional {±1} class label vector. In matrix notation, the 
training data-set X is a d x n matrix, consisting of n data-points and d features {d^ n). 
Throughout this study, we assume that X is a full-rank matrix. We shall consider the 
linear kernel, which can be written as K = X^X. Using the SVD of X, the optimal 
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solution of Eqn. 1 in the full-dimensional space is 


xopi = V + AI) 'VV 


( 2 ) 


The vector Xopt can be used as a classification function that generalizes to test data. If 
q G is the new test point, then the binary classification function is: 

/(q) = xlpiX'^q. (3) 


Then, sign{f{q)) gives the predicted label (—1 or +1) to be assigned to the new test 
point q. 

Our goal is to study how RLSC performs when the deterministic sampling based 
feature selection algorithm is used to select features in an unsupervised setting. Let 
R e be the matrix that samples and re-scales r rows of X thus reducing the 
dimensionality of the training set from dtor<^d and r is proportional to the rank of 
the input matrix. The transformed dataset into r dimensions is given by X = RX and 
the RLSC problem becomes 


mm 

xSR" 


Kx-y 


Ax^Kx, 


( 4 ) 


thus giving an optimal vector Xopt- The new test point q is first dimensionally reduced 
to q = Rq, where q G and then classified by the function. 


/ = /(q) = ^opt^ R 


( 5 ) 


In subsequent sections, we will assume that the test-point q is of the form q = Xck -f 
The first part of the expression shows the portion of the test-point that is similar 
to the training-set and the second part shows how much the test-point is novel compared 
to the training set, i.e. \\(3\\2 measures how much of q lies outside the subspace spanned 
by the training set. 
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3.4 Ridge Regression Basics 


Consider a data-set X of n points in d dimensions with d ^ n. Here X eontains n 
i.i.d samples from the d dimensional independent variable, y G is the real-valued 
response vector. Ridge Regression(RR) or Tikhonov regularization penalizes the £2 
norm of a parameter vector (3 and shrinks the estimated coefficients towards zero. In 
the fixed design setting, we have y = X^/3 -f lu where cu G is the homoskedastic 
noise vector with mean 0 and variance a^. Let (3x be the solution to the ridge regression 
problem. The RR problem is stated as: 

0x = arg min - ||y - X^/^H^ -f A \\f3\\l. (6) 

The solution to Eqn.6 is 0x = (XX^ -|- nXld) ^ Xy. One can also solve the same 
problem in the dual space. Using change of variables, f3 = Xck, where ck G and let 
K = X^Xbe the nxn linear kernel defined over the training dataset. The optimization 
problem becomes: 

'' ^11 11 2 

q:a = arg min — y — Kck L -|- Ao: Ka. (7) 

ctsK" n 


Throughout this study, we assume that X is a full-rank matrix. Using the SVD of X, 
the optimal solution in the dual space (Eqn. 7) for the full-dimensional data is given by 
q;a = (K -f nAIji)”^ y. The primal solution is /3 a = Xcka. 

In the sampled space, we have K = X X. The dual problem in the sampled space 
can be posed as: 


1 

q:a = arg min — 
aSR" n 

/ ~ \ -1 


-Kck 


-f Xa Kck. 


The optimal dual solution in the sampled space is cka = (^K + nXlnj y. 
solutionis /3 a = XcIa- 


( 8 ) 


The primal 
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3.5 Related Work 


The work most closely related to ours is that of Dasgupta et al. (2007) who used a 
leverage-score based randomized feature selection technique for RLSC and provided 
worst case bounds of the approximate classifier with that of the classifier for all fea¬ 
tures. The proof of their main quality-of-approximation results provided an intuition 
of the circumstances when their feature selection method will work well. The running 
time of leverage-score based sampling is dominated by the time to compute SVD of the 
training set i.e. O {n^d), whereas, for single-set spectral sparsification, it is O {rd'n?). 
Single-set spectral sparsification is a slower and more accurate method than leverage- 
score sampling. Another work on dimensionality reduction of RLSC is that of Avron 
et al. (2013) who used efficient randomized-algorithms for solving RLSC, in settings 
where the design matrix has a Vandermonde structure. However, this technique is dif¬ 
ferent from ours, since their work is focused on dimensionality reduction using linear 
combinations of features, but not on actual feature selection. 

Lu et al. (2013) used Randomized Walsh-Hadamard transform to lower the dimension 
of data matrix and subsequently solve the ridge regression problem in the lower dimen¬ 
sional space. They provided risk-bounds of their algorithm in the fixed design setting. 
However, this is different from our work, since they use linear combinations of features, 
while we select actual features from the data. 
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4 Our main tools 


4.1 Single-set Spectral Sparsification 

We describe the Single-Set Spectral Sparsification algorithm (BSS^ for short) of Bat¬ 
son et al. (2009) as Algorithm 1. Algorithm 1 is a greedy technique that selects 
columns one at a time. Consider the input matrix as a set of d column vectors = 
[ui, U 2 ,Urf], with Uj G (f = 1,d). Given i and r > i, we iterate over r = 
0,1, 2, ..r — 1. Define the parameters Lr = r — V^, Sl = l,Ur = Su [t + Vir) and 


Su = ~ j. For 17, L G M and A G a symmetric positive 

definite matrix with eigenvalues Ai,A 2 ,...,A£, define 

i=l i=l 

as the lower and upper potentials respectively. These potential functions measure how 
far the eigenvalues of A are from the upper and lower barriers U and L respectively. 
We define C (u, 6 l , A, L) and U (u, 6u, A, U) as follows: 

C (u, A, L) = ~ ^ (A - (L + 6l) I.)"^ u 


$(L + (5i,A)-<h(L,A) 


u^{{U + 6u)h-A)-^u , ^ 


U{u,6u,A,U) = 


-f u^' {{U + 6u) - A) % 


$ (f/, A) - $ (f/+ dt;, A) 

At every iteration, there exists an index v and a weight > 0 such that, tr ^ ^ 

C (uj^, (5^, A, L) and > U (u*^, d;/, A, U). Thus, there will be at most r columns 
selected after r iterations. The running time of the algorithm is dominated by the search 
for an index v satisfying 

lA (uj^, d(7, A^-, U-f) ^ C (uj^, (5^, A.p, 


1 


The name BSS comes from the authors Batson, Spielman and Srivastava. 
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and computing the weight tr- One needs to eompute the upper and lower potentials 
$ {U, A) and $ (L, A) and henee the eigenvalues of A. Cost per iteration is O (£^) and 
the total cost is O (r£^). For i = 1,.., d, we need to eompute £ and W for every u* whieh 
can be done in O {d£‘^) for every iteration, for a total of O {rd£'^). Thus total running 
time of the algorithm is O {rd£'^). We present the following lemma for the single-set 
speetral sparsifieation algorithm. 


Input: = [vi, V2, ...v^] e with Vj G and r > £. 

Output: Matriees S e D e R'’^'’. 


1. Initialize Aq = O^x^, S = O^xr, D = O^xr- 


2. Set eonstants (5 l = 1 and 6u = 


(^1 + / (i - • 


3. for r = 0 to r — 1 do 



• Pick index i G {1, 2, ..d} and number tr > 0, such that 




• Update A.r +1 = A.^ + trVivf ; set = 1 and D^^+i.r+i = I/a/U- 


4. end for 


5. Multiply all the weights in D by 



6. Return S and D. 


Algorithm 1: Single-set Speetral Sparsifieation 
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Lemma 1. BSS (Batson et al, 2009): Given U G satisfying U^U = and 
r > i, we can deterministically construct sampling and rescaling matrices S G 
and D G with R = DS, such that, for all y G : 

(l - ^/IPy llUyll^ < IIRUyll^ < (l + ||Uy||^ 

We now present a slightly modified version of Lemma 1 for our theorems. 


Lemma 2. Given U G satisfying U^U = \i and r > I, we can deterministically 
construct sampling and rescaling matrices S G and D G such that for 


R= DS, 


U^U - U^R^RUjl^ < 


Proof From Lemma 1, it follows, 


(U^R^RU) > (l - and cxi (U'^R^RU) < (l + \/^)^ 


Thus, 

Xma. (U^U - U^R'^RU) < (^1 - (l - < 2V^- 

Similarly, 

X^in (U'^U - U^R'^RU) > (^1 - (l + > 3v^- 

Combining these, we have ||Lr^U — U^R^RUH^ < 3y/£/r. 

Note: Let e = 2>^JTjr. It is possible to set an upper bound on e by setting the value of 
r. We will assume e G (0,1/2]. □ 
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4.2 Leverage Score Sampling 


Our randomized feature selection method is based on importance sampling or the so- 
called leverage-score sampling of Rudelson and Vershynin (2007). Let U be the top-p 
left singular vectors of the training set X. A carefully chosen probability distribution of 
the form 

Pi = -fori = 1,2, (9) 

n 

i.e. proportional to the squared Euclidean norms of the rows of the left-singular vec¬ 
tors and select r rows of U in i.i.d trials and re-scale the rows with 1 /The time 
complexity is dominated by the time to compute the SVD of X. 

Lemma 3. (Rudelson and Vershynin, 2007) Let e G (0,1/2] be an accuracy parameter 
and 5 G (0,1) be the failure probability. Given U G satisfying U^U = L. 

Let p = min{l,rpi}, let Pi be as Eqn. 9 and let r = O log Con¬ 

struct the sampling and rescaling matrix R. Then with probability at least (1 — 5), 

IjU^U-U^R^RUll^ < e. 

5 Theory 

In this section we describe the theoretical guarantees of RLSC using BSS and also 
the risk bounds of ridge regression using BSS and Leverage-score sampling. Before 
we begin, we state the following lemmas from numerical linear algebra which will be 
required for our proofs. 

Lemma 4. (Stewart and Sun, 1990) For any matrix E, such that I -f E A invertible, 

OO 

(I + E)-'=I + E(-E)k 

i=l 
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Lemma 5. (Stewart and Sun, 1990) Let A and A = A-\-'E be invertible matrices. Then 


A ^ = -A~^EA 


Lemma 6. (Demmel and Veselic, 1992) Let D and X be matrices such that the product 


DXD is a symmetric positive definite matrix with matrix Xjj = 1. Let the product 
DED be a perturbation such that, ||-E '||2 = p < Amm(X). Here Xmin corresponds to 
the smallest eigenvalue ofiK.. Let Xi be the i-th eigenvalue o/DXD and let Xi be the 


i-th eigenvalue o/D (X + E) D. Then, 




Xi 


<_ n _. 

Xmin (^) 


Lemma 7. Let e E (0,1/2], Then llq^U^U^^R^RUlL < e llU^U^^qlL . 


The proof of this lemma is similar to Lemma 4.3 of Drineas et al. (2006). 


5.1 Our Main Theroems on RLSC 

The following theorem shows the additive error guarantees of the generalization bounds 
of the approximate elassifer with that of the elassifier with no feature seleetion. The 
elassifieation error bound of BSS on RLSC depends on the eondition number of the 
training set and on how mueh of the test-set lies in the subspaee of the training set. 

Theorem 1. Let e E (0,1/2] be an accuracy parameter, r = O be the number 

of features selected by BSS. Let R G be the matrix, as defined in Lemma 2. Let 
X G with d » n, be the training set, X = RX is the reduced dimensional 

matrix and q G the test point of the form q = Xck + Then, the following 

hold: 
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If\ = 0, then 


q^Xxopt - q'^Xx, 


Opt 


< ^Il/3|l2 Ilyll2- 


then 


q^Xxopt - q^Xx, 


Opt 


<26«:x Halls Ily|l2 + !sil/ 3 |l 2 lly 


Proof. We assume that X is a full-rank matrix. Let E = U^U — U^R^RU and 
IIEII 2 = ||l — U^R^RU ||2 = e < 1/2. Using the SVD of X, we define 

A = SU^R^RUS = S (I + E) S. (10) 


The optimal solution in the sampled spaee is given by, 

x„p* = V(A + AI)-'V^y. (11) 

It ean be proven easily that A and A + AI are invertible matriees. We foeus on the term 
q^Xxopt. Using the SVD of X, we get 

q^Xx^pi = a^X^Xx,pi + /3U^^(USV^)x„pi 

= (S^ + AI)”^ V^y (12) 

= CK^V (I + AS-2) V^y. (13) 

I 'T' 

Eqn(12) follows because of the fact U U = 0 and by substituting Xopf from Eqn.(2). 
Eqn.(13) follows from the fact that the matrices Tf and Tf + AI are invertible. Now, 


q^Xxopi - q'^XXopt = 

q'^Xxopt - q^R^RXxopi 


< 

q^Xxopt — a^X^R^RXxopi | 

(14) 

+ |/3^U^^R^RXiopt| . 

(15) 
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We bound (14) and (15) separately. Substituting the values of Xop* and A, 


aVll^RXScopt = CK^VAV^Xopi 


= a^VA (A + AI)"W^y 
= a^V(l +AA-^)”V^y 
= CK^V (I + AS"^ (I + E)"^ S"i) V^y 
= CK^V (I + AS-2 + AS"^$S"^) V^y. (16) 


The last line follows from Lemma 4 in Appendix, whieh states that (I + E) ^ = I + $, 

CX) 

where $ = X] (“E)*. The speetral norm of $ is bounded by, 

2=1 


l$l 


E(-E)' 


2=1 


<5^ ||E||-<y^£‘ = £/(!-£). 

2 *=1 


(17) 


2=1 


We now bound (14). Substituting (13) and (16) in (14), 


I q^Xxopi — CK^X^R^RXxopi | 

= a^V{(l + AS-2 + AS-^$S-^)"^ - (I + AS-2)"^}V^y 
< lla^V(l + AS-2)||2 ||V^y||2 11^112- 


The last line follows beeause of Lemma 5 and the faet that all matriees involved are 
invertible. Here, 

^ = AS”^$S-^ (1 +AS-2 +AS-^$S-^)"^ 

= AS”^$S-^ (S-^ (SV AI +A$) 

= AS~^$ (s2 + AI + A$)”^S. 

Sinee the speetral norms of S, S~^ and $ are bounded, we only need to bound the 
speetral norm of (S^ + AI + A$) ^ to bound the speetral norm of The speetral 
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norm of the matrix + AI + A$) ^ is the inverse of the smallest singular value of 
+ AI + A$) . From perturbation theory of matriees Stewart and Sun (1990) and 
(17), we get 

\(Ti (S^ + AI + A$) - cTi (S^ + AI) I < ||A $||2 < eA. 

Here, ai(Q) represents the singular value of the matrix Q. 

Also, (S^ + Al) = ai^ + A, where cxj are the singular values of X. 

+ (1 — e)A < (Ji + AI + A$) < cTj^ + (1 + e)A. 


Thus, 


+ AI + A$) ' = 1/armn + AI + A$) < 1/ {a\in + (1 - e)A)) . 


Here, a^ax and amin denote the largest and smallest singular value of X. Sinee || S || g 11S 
O'max/cr min < (coudition number of X) we bound (14): 


Iq'^Xxopt - a'^X'^R'^RXxopJ < 


e\n^ 


O^ min T (1 e)A 


-2\-l 


(I + AS-2) 


|v^y| 


(18) 


For A > 0, the term a'^min + (1 — e)A in Eqn.(18) is always larger than (1 — e) A, so it 
ean be upper bounded by 2eKx (assuming e < 1/2). Also, 


-2\-l 


a^V (I + AS-2) 


< lla^Vl 


-2\-l 


(I + AS-2) 


< Il«ll2- 


This follows from the faet, that = ||q :||2 and ||Vy ||2 = ||y ||2 as V is a full- 

rank orthonormal matrix and the singular values of I + AX”^ are equal to 1 + A/cij^; 
making the speetral norm of its inverse at most one. Thus we get. 


Xxw - CK^X^R^RXxopJ < 2eKx Iloilo IIY 


12 • 


(19) 
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We now bound (15). Expanding (15) using SVD and Xopt, 


|/3^U^^R^RXiopi| = 

|/3'^U^^R^RUS (A + AI) V^y 1 

< 

llq^U^U^^R^RUjl^ ||S(A + AI)^ 

< 

e||U^U^^q||2 ||V^y||2 ||S(A + AI) 

< 

e||/3||2 ||y||2 |1S(A + AI)-'||,. 


The first inequality follows from (3 = and the second inequality follows from 

Lemma 7. To conclude the proof, we bound the spectral norm of S (A + AI)”\ Note 
that from Eqn.(lO), = I + E and = I, 

S (A + AI)"^ = (S-^AS-^ + AS-2)"^ = (I + AS-2 + E)”^ 


One can get a lower bound for the smallest singular value of (l + AS ^ + E) ^ using 
matrix perturbation theory and by comparing the singular values of this matrix to the 
singular values of I + AS“^. We get, 


(1 - e) + ^ < cXi (I + E + AS-') < (1 + e) + 


(l + AS-' + E) ^S"^ 


< 


< 

We assumed that e < 1/2, which implies (1 — 
we get. 


^ max 

((1 (7 rnax “ 1 “ '^) ^min 

^max 

(1 c) (T rnax “1“ ^ 


2^X 

^max 

c) X/(J max 


( 20 ) 

>1/2. Combining these, 


|/3^U^^R^RXx,p,| < ^ m, ||y||2 . (21) 

^max 

Combining Eqns (19) and (21) we complete the proof for the case A > 0. Eor A = 0, 
Eqn.(18) becomes zero and the result follows. □ 
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Our next theorem provides relative-error guarantees to the bound on the elassification 


error when the test-point has no-new eomponents, i.e. /3 = 0. 


Theorem 2. Let e G (0,1/2] be an accuracy parameter, r = O {n/e^) be the number of 
features selected by BSS and A > 0. Let q E be the test point of the form q = Xa, 
i.e. it lies entirely in the subspace spanned by the training set, and the two vectors V y 
and (l -|- ^ satisfy the property. 


(l + AS-2) 


cx 


|V^y||2 < a; ((l + AS-^) ' V^y 

= wlq^Xxop/ 


for some constant oj. If we run RLSC after BSS, then 


q "^Xopt q 


< 2eunx. q^Xx, 


opt 


The proof follows directly from the proof of Theorem 1 if we consider (3 = 0. 

5.2 Our Main Theorems on Ridge Regression 

We compare the risk of subsampled ridge regression with the risk of true dual ridge 
regreesion in the fixed design setting. Recall that the response vector y = X^/3 -f a; 
where cu G M" is the homoskedastic noise vector with mean 0 and variance a^. Also, 
we assume that the data matrix is of full rank. 

Lemma 8. Let p be the rank of Ik. Form K using BSS. Then, 

(1 - A)K A K A (1 + A)K, 

where A = C\/ pjr. Forp.s.d matrices A A B means B — A A p.s.d. 
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Proof. Using the SVD of X, K = VS (U^R^RU) SV^. Lemma 2 implies 
Ip (1 - A) ^ (U^R^RU) ^ Ip (1 + A). 

Multiplying left and right hand side of the inequality by VS and SV^ respectively, to 
the above inequality completes the proof. □ 

Lemma 9. Let p be the rank of^. Form K using leverage-score sampling. Then, with 
probability at least (1 — 5), where 6 G (0,1), 

(1 - A)K ^ K ^ (1 + A)K, 

where A = C-^ log . 


5.1 Risk Function for Ridge Regression 

Let z = E^[y] = X^/3. The risk for a prediction function y G M” is ||y — zjl^- 
For any n x n positive symmetric matrix K, we define the following risk function: 

2 

R (K) = —Tr (K^ (K + riAR)"^) + nX^z^ (K + nXlA~^ z. 
n 

Theorem 3. Under the fixed design setting, the risk for the ridge regression solution 
in the full-feature space is i?(K) and the risk for the ridge regression in the reduced 
dimensional space is i?(K). 


Proof. The risk of the ridge regression estimator in the reduced dimensional space is 
1 


-E, 


n 



2 1 


KAa — z 

= -E^ 

2 n 

k(k + riXln) y - z 


( 22 ) 
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-1 


Taking K ( K + nAI„ ) as Q we can write Eqn.(22) as, 


^E^IIQy-E^ [Qy]|l2 + - 


-E„ 


n 

-1 

n 

n 


K K + nXln u) 


-2 


n 

2" 

1 

to 


H — 


2 

n 


+ 

1 

ISI 



/ n 


K K + nXln z — z 


-1 


-1 


-2 


Tr ( K‘ ( K + n\ln ) 1 + riA^z^ ( K + nAI„ ) z. 


-2 


The expectation is only over the random noise uj and is conditional on the feature se¬ 
lection method used. □ 


Our next theorem bounds the risk inflation of ridge regression in the reduced dimen¬ 
sional space compared with the ridge regression solution in the full-feature space. 

Theorem 4. Let p be the rank of the matrix X. When using leverage-score sampling as 
a feature selection technique, with probability at least 1 — 5, where 6 G (0,1), 

RiK) < (1-A)-2i?(K), 

where A = C^ log . 

Proof For any positive semi-definite matrix, K G we define the bias i?(K) and 

variance V (K) of the risk function as follows: 

B(K) = nA^z^ (K + nAI„)"^ z, 

V{K) = ^Tr (k + nAI„) . 

Therefore, i?(K) = B(K) -\- F(K). Now due to Bach (2013) we know B(K) is non- 
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increasing in K and V (K) is non-decreasing in K. When Lemma 9 holds, 


R(K) = V{K) + B(K) 

< 1/((1 + A)K) + 5((1-A)K) 

< (1 +A)2i/(K) + (1-A)"^5(K) 

< (l-A)-^(l-(K) + i?(K)) 

= (1 - A)-^i?(K). 

□ 

We can prove a similar theorem for BSS. 

Theorem 5. Let p be the rank of the matrix X. When using BSS as a feature selection 
technique, with A = Cp/e^ , 

R{K) < (1 - A)-2i?(K). 


6 Experiments 

All experiments were performed in MATLAB R2013b on an Intel i-7 processor with 
16GB RAM. 

6.1 BSS Implementation Issues 

The authors of Batson et al. (2009) do not provide any implementation details of the 
BSS algorithm. Here we discuss several issues arising during the implementation. 
Choice of column selection: At every iteration, there are multiple columns which sat¬ 
isfy the condition W (uj, 5u, At-, Ur) < C (uj, At-, Lr). The authors of Batson et al. 
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(2009) suggest picking any column which satisfies this constraint. Instead of breaking 
ties arbitrarily, we choose the column Uj which has not been selected in previous itera¬ 
tions and whose Euclidean-norm is highest among the candidate set. Columns with zero 
Euclidean norm never get selected by the algorithm. In the inner loop of Algorithm 1, 
U and C has to be computed for all the d columns in order to pick a good column. This 
step can be done efficiently using a single line of Matlab code, by making use of matrix 
and vector operations. 

6.2 Other Feature Selection Methods 

In this section, we describe other feature-selection methods with which we compare 
BSS. 


6.1 Rank-Revealing QR Factorization (RRQR) 


Within the numerical linear algebra community, subset selection algorithms use the so- 
called Rank Revealing QR (RRQR) factorization. Here we slightly abuse notation and 
state A as a short and fat matrix as opposed to the tall and thin matrix. Eet A be a n x d 
matrix with (n < d) and an integer k{k < d) and assume partial QR factorizations of 
the form 

/ \ 

Rii Ri2 


AP = Q 


^ 0 R22y 

where Q G is an orthogonal matrix, P G is a permutation matrix, Rn G 
Rfcxfc^ R;^2 £ R 22 G The above factorization is called a RRQR 

factorization if cr^m (Rii) > cjfc (A)/p(/c, d), cr^ax (R22) < crmm(A)p(/i;, d), where 
p{k,d) is a function bounded by a low-degree polynomial in k and d. The important 
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columns are given by Ai = Q 




V ° / 


and (Tj (Ai) = (Rn) with 1 < i < k. 


We perform feature seleetion using RRQR by pieking the important eolumns whieh 


preserve the rank of the matrix. 


6.2 Random Feature Selection 

We seleet features uniformly at random without replaeement whieh serves as a baseline 
method. To get around the randomness, we repeat the sampling proeess five times. 


6.3 Leverage-Score Sampling 

For leverage-score sampling, we repeat the experiments five times to get around the 
randomness. We pick the top-p left singular vectors of X, where p is the rank of the 
matrix X. 


6.4 Information Gain (IG) 

The Information Gain feature selection method (Yang and Pedersen, 1997) measures the 
amount of information obtained for binary class prediction by knowing the presence or 
absence of a feature in a dataset. The method is a supervised strategy, whereas the other 
methods used here are unsupervised. 


6.3 Experiments on RLSC 

The goal of this section is to compare BSS with existing feature selection methods for 
RLSC and show that BSS is better than the other methods. 
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Table 1: Most frequently selected features using the synthetic dataset. 


r = 80 

k = 90 

k = 100 

BSS 

89, 88, 87, 86, 85 

100, 99, 98, 97, 95 

RRQR 

90, 80, 79, 78, 77 

100, 80, 79, 78, 77 

Lvg-Score 

73, 85, 84, 81, 87 

93, 87, 95, 97, 96 

IG 

80, 79, 78, 77, 76 

80, 79, 78, 77, 76 

r = 90 

k = 90 

k = 100 

BSS 

90, 88, 87, 86, 85 

100, 99, 98, 97, 96 

RRQR 

90, 89, 88, 87, 86 

100, 90, 89, 88, 87 

Lvg-Score 

67, 88, 83, 87, 85 

100, 97, 92,48,58 

IG 

90, 89, 88, 87, 86 

90, 89, 88, 87, 86 


Table 2: Running time of various feature selection methods in seconds. For synthetic data, the 
running time corresponds to the experiment when r = 80 and fc = 90 and is averaged over 
ten ten-fold cross-validation experiments. For TechTC-300, the running time corresponds to the 
experiment when r = 400 and is averaged over ten ten-fold cross-validation experiments and 
over 48 TehTC-300 datasets. 



BSS 

IG 

LVG 

RRQR 

Synthetic Data 

0.1025 

0.0003 

0.0031 

0.0016 

TechTC-300 

75.7624 

0.0242 

0.4054 

0.2631 


6.1 Synthetic Data 

We run our experiments on synthetie data where we control the number of relevant fea¬ 
tures in the dataset and demonstrate the working of Algorithm 1 on RLSC. We generate 
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synthetic data in the same manner as given in Bhattacharyya (2004). The dataset has n 
data-points and d features. The class label yt of each data-point was randomly chosen 
to be 1 or -1 with equal probability. The first k features of each data-point Xj are drawn 
from ViM {—j, 1) distribution, where AA(/i, is a random normal distribution with 
mean y and variance cr^ and j varies from 1 to k. The remaining d — k features are 
chosen from a AA(0,1) distribution. Thus the dataset has k relevant features and {d — k) 
noisy features. By construction, among the first k features, the kth feature has the most 
discriminatory power, followed by {k — l)th feature and so on. We set n to 30 and d to 
1000. We set k to 90 and 100 and ran two sets of experiments. 

We set the value of r, i.e. the number of features selected by BSS to 80 and 90 for 
all experiments. We performed ten-fold cross-validation and repeated it ten times. The 
value of A was set to 0, 0.1, 0.3, 0.5, 0.7, and 0.9. We compared BSS with RRQR, 
IG and leverage-score sampling. The mean out-of-sample error was 0 for all methods 
for both fc = 90 and k = 100. Table 1 shows the set of five most frequently selected 
features by the different methods for one such synthetic dataset across 100 training sets. 
The top features picked up by the different methods are the relevant features by con¬ 
struction and also have good discriminatory power. This shows that BSS is as good 
as any other method in terms of feature selection and often picks more discriminatory 
features than the other methods. We repeated our experiments on ten different synthetic 
datasets and each time, the five most frequently selected features were from the set of 
relevant features. Thus, by selecting only 8%-9% of all features, we show that we are 
able to obtain the most discriminatory features along with good out-of-sample error 
using BSS. 
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Table 3: Out-of-sample error of TechTC-300 datasets averaged over ten ten-fold cross- 
validation and over 48 datasets for three values of r. The first and second entry of each cell 
represents the mean and standard deviation. Items in bold indicate the best results. 


r = 300 

A = 0.1 

A = 0.3 

A = 0.5 

A = 0.7 

BSS 

31.76 ± 0.68 

31.46 ± 0.67 

31.24 ± 0.65 

31.03 ± 0.66 

Lvg-Score 

38.22 ± 1.26 

37.63 ± 1.25 

37.23 ± 1.24 

36.94 ± 1.24 

RRQR 

37.84 ± 1.20 

37.07 ± 1.19 

36.57 ± 1.18 

36.10 ± 1.18 

Randomfs 

50.01 ± 1.2 

49.43 ± 1.2 

49.18 ± 1.19 

49.04 ± 1.19 

IG 

38.35 ± 1.21 

36.64 ± 1.18 

35.81 ± 1.18 

35.15 ± 1.17 

r = 400 

A = 0.1 

A = 0.3 

A = 0.5 

A = 0.7 

BSS 

30.59 ± 0.66 

30.33 ± 0.65 

30.11 ± 0.65 

29.96 ± 0.65 

Lvg-Score 

35.06 ± 1.21 

34.63 ± 1.20 

34.32 ± 1.2 

34.11 ± 1.19 

RRQR 

36.61 ± 1.19 

36.04 ± 1.19 

35.46 ± 1.18 

35.05 ± 1.17 

Randomfs 

47.82 ± 1.2 

47.02 ± 1.21 

46.59 ± 1.21 

46.27 ± 1.2 

IG 

37.37 ± 1.21 

35.73 ± 1.19 

34.88 ± 1.18 

34.19 ± 1.18 

r = 500 

A = 0.1 

A = 0.3 

A = 0.5 

A = 0.7 

BSS 

29.80 ± 0.77 

29.53 ± 0.77 

29.34 ± 0.76 

29.18 ± 0.75 

Lvg-Score 

33.33 ± 1.19 

32.98 ± 1.18 

32.73 ± 1.18 

32.52 ± 1.17 

RRQR 

35.77 ± 1.18 

35.18 ± 1.16 

34.67 ± 1.16 

34.25 ±1.14 

Randomfs 

46.26 ± 1.21 

45.39 ± 1.19 

44.96 ±1.19 

44.65 ± 1.18 

IG 

36.24 ± 1.20 

34.80 ± 1.19 

33.94 ± 1.18 

33.39 ± 1.17 
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Experiment It 

TechTC-300, r-300M.5 



Experiment# 


TechTC-300, r=300).=0.3 



Figure 1: Out-of-sample error of 48 TechTC-300 documents averaged over ten ten¬ 
fold cross validation experiments for different values of regularization parameter A and 
number of features r = 300. Vertical bars represent standard deviation. 

Though running time is not the main subject of this study, we would like to point out 
that we computed the running time of the different feature selection methods averaged 
over ten ten-fold cross validation experiments. The time to perform feature selection 
for each of the methods averaged over ten ten-fold cross-validation experiments was 
less than a second (See Table 2), which shows that the methods can be implemented in 
practice. 

6.2 TechTC-300 

We use the TechTC-300 data Davidov et al. (2004), consisting of a family of 295 
document-term data matrices. The TechTC-300 dataset comes from the Open Direc- 
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Figure 2: Out-of-sample error of 48 TeohTC-300 doeuments averaged over ten ten¬ 
fold eross validation experiments for different values of regularization parameter A and 
number of features r = 400 and r = 500. Vertieal bars represent standard deviation. 
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tory Project (ODP), which is a large, comprehensive directory of the web, maintained 
by volunteer editors. Each matrix in the TechTC-300 dataset contains a pair of cat¬ 
egories from the ODP Each category corresponds to a label, and thus the resulting 
classification task is binary. The documents that are collected from the union of all 
the subcategories within each category are represented in the bag-of-words model, with 
the words constituting the features of the data Davidov et al. (2004). Each data ma¬ 
trix consists of 150-280 documents, and each document is described with respect to 
10,000-50,000 words. Thus, TechTC-300 provides a diverse collection of data sets for 
a systematic study of the performance of the RESC using BSS. We removed all words 
of length at most four from the datasets. Next we grouped the datasets based on the 
categories and selected those datasets whose categories appeared at least thrice. There 
were 147 datasets, and we performed ten-fold cross validation and repeated it ten times 
on 48 such datasets. We set the values of the regularization parameter of RESC to 
0.1,0.3,0.5 and 0.7. 

We set r to 300, 400 and 500. We report the out-of-sample error for all 48 datasets. 
BSS consistently outperforms Eeverage-Score sampling, IG, RRQR and random fea¬ 
ture selection on all 48 datasets for all values of the regularization parameter. Table 3 
and Eig 1 shows the results. The out-of-sample error decreases with increase in number 
of features for all methods. In terms of out-of-sample error, BSS is the best, followed 
by Eeverage-score sampling, IG, RRQR and random feature selection. BSS is at least 
3%-7% better than the other methods when averaged over 48 document matrices. Erom 
Eig 1 and 2, it is evident that BSS is comparable to the other methods and often better on 
all 48 datasets. Eeverage-score sampling requires greater number of samples to achieve 
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Table 4: A subset of the TechTC matrices of our study. 


idl_id2 

idl 

id2 

1092.789236 

Arts: Music: S tyles: Opera 

US NavyiDecommisioned Submarines 

17899.278949 

US:Michigan:Travel & Tourism 

Recreation:Sailing Clubs:UK 

17899.48446 

US:Michigan:Travel & Tourism 

Chemistry:Analytical:Products 

14630.814096 

US :Colorado:Localities:Boulder 

Europe:Ireland:Dublin:Localities 

10539.300332 

US:Indiana:Localities:S 

Canada: Ontario lEocali ties :E 

10567.11346 

US:Indiana:Evansville 

US:Elorida:Metro Areas:Miami 

10539.194915 

US:Indiana:Localities:S 

US:Texas:Eocalities:D 


Table 5: Frequently occurring terms of the TechTC-300 datasets of Table 4 selected by BSS 


idl.id2 

words 

1092.789236 

naval,shipyard,submarine,triton,music,opera,libretto,theatre 

17899.278949 

sailing,cruising,boat,yacht,racing,michigan,leelanau,casino 

17899.48446 

vacation,lodging,michigan,asbestos,chemical,analytical,laboratory 

14630.814096 

ireland,dublin,boulder,colorado,lucan,swords,school,dalkey 

10539.300332 

ontario,fishing,county,elliot,schererville,shelbyville,indiana,bullet 

10567.11346 

florida,miami,beach,indiana,evansville,music,business,south 

10539.194915 

texas,dallas,plano,denton,indiana,schererville,gallery,north 
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Table 6: Frequently occurring terms of the TechTC-300 datasets of Table 4 selected by 


Leverage-Score Sampling 


idl_id2 

words 

1092.789236 

sturgeon, seawolf, skate, triton, frame, opera, finback 

17899.278949 

sailing, yacht, laser, michigan,breakfast, county, clear 

17899.48446 

analysis, michigan, water, breakfast, asbestos, environmental, analytical 

14630.814096 

Ireland, dublin, estate, lucan, dalkey, Colorado, boulder 

10539.300332 

library, fishing, service, lodge, Ontario, elliot, indiana, shelbyville 

10567.11346 

evansville, services, health, church, south, bullet, tlorida 

10539.194915 

dallas, texas, Schererville, indiana, shelbyville, piano 


the same out-of-sample error as BSS (See Table 3, r = 500 for Lvg-Seore and r = 300 
for BSS). Therefore, for the same number of samples, BSS outperforms leverage-seore 
sampling in terms of out-of-sample error. The out-of-sample error of supervised IG is 
worse than that of unsupervised BSS, whieh eould be due to the worse generalization of 
the supervised IG metrie. We also observe that the out-of-sample error deereases with 
inerease in A for the different feature seleetion methods. 

We list the most frequently oeeurring words seleeted by BSS and leverage-seore sam¬ 
pling for the r = 300 ease for seven TeehTC-300 datasets over 100 training sets used 
in the eross-validation experiments. Table 4 shows the names of the seven TeehTC-300 
doeument-term matrices. The words shown in Tables 5 and 6 were selected in all cross- 
validation experiments for these seven datasets. The words are closely related to the cat¬ 
egories to which the documents belong, which shows that BSS and leverage-score sam- 
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pling select important features from the training set. For example, for the document-pair 
(1092_789236), where 1092 belongs to the category of “Arts:Music:Styles:Opera” and 
789236 belongs to the category of “US:Navy: Decommisioned Submarines”, the BSS 
algorithm selects submarine, shipyard, triton, opera, libretto, theatre which are closely 
related to the two classes. The top words selected by leverage-score sampling for the 
same document-pair are seawolf, sturgeon, opera, triton finback, which are closely re¬ 
lated to the class. Another example is the document-pair 10539_300332, where 10539 
belongs to “US:Indiana:Localities:S” and 300332 belongs to the category of “Canada: 
Ontario: Localities:E”. The top words selected for this document-pair are Ontario, elliot, 
shelbyville, indiana, Schererville which are closely related to the class values. Thus, we 
see that using only 2%-4% of all features we are able to select relevant features and 
obtain good out-of-sample error. The top words selected by leverage-score sampling 
are library, fishing, elliot, indiana, shelbyville, Ontario which are closely related to the 
class. 

Though feature selection is an offline task, we give a discussion of the running times 
of the different methods to highlight that BSS can be implemented in practice. We 
computed the running time of the different feature selection methods averaged over ten 
ten-fold cross validation experiments and over 48 datasets (See Table 2). The average 
time for feature selection by BSS is approximately over a minute, while the rest of the 
methods take less than a second. This shows that BSS can be implemented in practice 
and can scale up to reasonably large datasets with 20,000-50,000 features. For BSS and 
leverage-score sampling, the running time includes the compute to compute SVD of the 
matrix. BSS takes approximately a minute to select features, but is at least 3%-7% bet- 
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ter in terms of out-of-sample error than the other methods. IG takes less than a seeond 
to select features, but is 4%-7% worse than BSS in terms of out-of-sample error. 


6.4 Experiments on Ridge Regression in the fixed design setting 

In this section, we describe experiments on feature selection on ridge regression in the 
fixed design setting using synthetic and real data. 


MSE/Risk for BSS MSE/Risk for Leverage-Score Sampling 



fc = 90 


MSE/Risk for BSS 



W30 


r=6*n 

□r=7*n 

□r=8*n 

|r=9*n 

Ifull 


MSE/Risk for Leverage-Score Sampling 



lambda=0.1 


lambda=0.3 lambda=0.5 


lambda=0.7 


lambda=0.1 


lambda=0.3 lambda=0.5 


lambda=0.7 


k = 100 

Figure 3: MSE/Risk for synthetic data for /c = 90 and k = 100 using different feature 
selection methods as a function of A. The risk after feature selection is comparable to 
the risk of full-data. 
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6.1 Synthetic Data 


We generate the features of the synthetie data X in the same manner as deseribed in 
Seetion 6.1. We generate f3 ~ AA(0,1) and y = X^/3 + uj, where cu G M” and f3 G 
We set n to 30 and d to 1000. We set the number of relevant features, k to 90 and 100 
and ran two sets of experiments. We set the value of r, i.e. the number of features 
selected by BSS and leverage-score sampling to t * n, where t = 6, 7,8,9 for both 
experiments. The value of A was set to 0.1, 0.3, 0.5 and 0.7. We compared the risk 
of ridge regression using BSS and leverage-score sampling with the risk of full-feature 
selection and report the MSE/Risk in the fixed design setting as a measure of accuracy. 
Fig 3 shows the risk of synthetic data for both BSS and leverage-score sampling as a 
function of A. The risk of the sampled data is comparable to the risk of the full-data in 
most cases, which follows from our theory. We observe that for higher values of A, the 
risk of sampled space becomes worse than that of full-data for both BSS and leverage- 
score sampling. The risk in the sampled space is almost the same for both BSS and 
Leverage-score sampling. The time to compute feature selection is less than a second 
for both methods (See Table 7). 

Table 7: Running time of various feature selection methods in seconds. For synthetic data, the 
running time corresponds to the experiment when r = 8n. For TechTC-300, the running time 
corresponds to the experiment when r = 400. 



Synthetic Data 

TechTC (10341-14525) 

TechTC (10341-61792) 

BSS 

0.3368 

68.8474 

67.013 

LVG 

0.0045 

0.3994 

0.3909 
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UJ 

C/D 



10341-61792 BSS 


15 


10341-14525 Lvg 


10 

be 


C/D 



■ r=300 

□ r=400 

□ r=500 

□ full 
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a 


A=0.1 A=0.3 A=0.5 



Figure 4: MSE/Risk for TechTC-300 data using different feature seleetion methods as 
a funetion of A. The risk after feature seleetion is eomparable to the risk of full-data. 


6.2 TechTC-300 


We use two TeohTC-300 datasets, namely “10341-14525” and “10341-61792” to illus¬ 
trate our theory. We add gaussian noise to the labels. We set the value of r, the number 
of features to be seleeted to 300,400 and 500. The value of A was set to 0.1, 0.3 and 0.5. 
We eompared the risk of ridge regression using BSS and leverage-seore sampling with 
the risk of full-feature seleetion and report the MSE/Risk in the fixed design setting as a 
measure of aeeuraey. Eig 4 shows the risk of real data for both BSS and leverage-seore 
sampling as a funetion of A. The risk of the sampled data is eomparable to the risk of 
the full-data in most oases, whioh follows from our theory. The risk of the sampled data 
deoreases with inerease in r. The time to perform feature seleetion is approximately a 
minute for BSS and less than a seoond for leverage-seore sampling (See Table 7). 
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7 Conclusion 


We present a provably aceurate feature seleetion method for RLSC whieh works well 
empirieally and also gives better generalization peformanee than prior existing methods. 
The number of features required by BSS is of the order O , whieh makes the result 
tighter than that obtained by leverage-score sampling. BSS has been recently used 
as a feature selection technique for k-means clustering (Boutsidis and Magdon-Ismail, 
2013), linear SVMs (Paul et ah, 2015) and our work on RLSC helps to expand research 
in this direction. The risk of ridge regression in the sampled space is comparable to 
the risk of ridge regression in the full feature space in the fixed design setting and we 
observe this in both theory and experiments. An interesting future work in this direction 
would be to include feature selection for non-linear kernels with provable guarantees. 
Acknowledgements. Most of the work was done when SP was a graduate student at 
RPI. This work is supported by NSF CCF 1016501 and NSF IIS 1319280. 
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